Predicting the Price of Airbnb Stays in Copenhagen

Introduction

Copenhagen, the happiest city in the world, Who doesn’t want to go there! Happiness comes at a price though, you’ll need to find a place to stay. What better way to look for a place to stay than on AirBnB and us being students, would like to know what this will cost us. This is where the MAM program from LBS is already paying itself off. To find out what variables (e.g. room type, property type) influence the price most so we can predict the cost with a regression analysis. So our dream holiday can become reality.

The data we will be using for this comes directly from AirBnB. In this analysis we will only be using data on the city of Copenhagen. the data shows the listings as per June 26th 2020.

This analysis is divided into 4 parts:

  1. Executive summary
  2. Exploratory data analysis
  3. Regression analysis
  4. Conclusion

Executive summary

Using AirBnB data for listings at the time of June 26th 2020, we predicted the costs of a stay in Copenhagen. The requirements were 4 nights with 2 persons, a private room and a minimum review score of 90 with at least 10 reviews. Our method was to run and exploratory data analysis to set the scene for the data that we would use to make our prediction. Once we had confirmed the important values that could have a significant influence on the price. we identified the following variables:

  • property type
  • number of reviews
  • review score
  • room type
  • bathrooms
  • bedrooms
  • accommodates
  • neighborhood
  • cancellation policy
  • Superhost

After running regression models with these variables we came to a model that had the most prediction power, model 7. With the specified criteria we made a prediction with 95% confidence that the cost for a median listing on Airbnb in Copenhagen is between 1,860.18 DKK and 1,938.58 DKK.

Exploratory data analysis

To get an idea of the data structure, an exploratory data analysis is conducted. The first steps are to look at the raw values and to clean the data afterwards. This gives a foundation to use in summary statistics. Finally, visualisations will give an impression of any correlations in variables and will give an outlook on variables that stand out.

Looking at the raw values

First of all, the data needs to be loaded. The clean_names() function is used to make the values consistent and neat, to help with the data wrangling.

listings <- vroom("http://data.insideairbnb.com/denmark/hovedstaden/copenhagen/2020-06-26/data/listings.csv.gz") %>%
  
    # Get the listings for Copenhagen
    clean_names() 
## Rows: 28,523
## Columns: 106
## Delimiter: ","
## chr  [45]: listing_url, name, summary, space, description, experiences_offered, neighborho...
## dbl  [40]: id, scrape_id, host_id, host_listings_count, host_total_listings_count, zipcode...
## lgl  [16]: thumbnail_url, medium_url, xl_picture_url, host_is_superhost, host_has_profile_...
## date [ 5]: last_scraped, host_since, calendar_last_scraped, first_review, last_review
## 
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message

Having looked at the data, the following variables will be of importance throughout this analysis:

  1. price : cost per night

  2. cleaning_fee: cleaning fee

  3. extra_people: charge for having more than 1 person

  4. property_type: type of accommodation (House, Apartment, etc.)

  5. room_type: Entire home/apt (guests have entire place to themselves) Private room (Guests have private room to sleep, all other rooms shared) Shared room (Guests sleep in room shared with others) number_of_reviews: Total number of reviews for the listing

  6. review_scores_rating: Average review score (0 - 100)

  7. longitude , latitude: geographical coordinates to help us locate the listing

  8. neighborhood*: three variables on a few major neighborhoods in each city

The next step is to check the data for the properties of the variables and if there are any missing or NA values that need to be addressed.

## Rows: 28,523
## Columns: 106
## $ id                                           <dbl> 6983, 26057, 26473, 2911…
## $ listing_url                                  <chr> "https://www.airbnb.com/…
## $ scrape_id                                    <dbl> 2.020063e+13, 2.020063e+…
## $ last_scraped                                 <date> 2020-06-28, 2020-06-28,…
## $ name                                         <chr> "Copenhagen 'N Livin'", …
## $ summary                                      <chr> "Lovely apartment locate…
## $ space                                        <chr> "Beautiful and cosy apar…
## $ description                                  <chr> "Lovely apartment locate…
## $ experiences_offered                          <chr> "none", "none", "none", …
## $ neighborhood_overview                        <chr> "Nice bars and cozy cafe…
## $ notes                                        <chr> NA, NA, NA, NA, "Please …
## $ transit                                      <chr> "Bus 66 runs to the cent…
## $ access                                       <chr> "Bedroom, living room, k…
## $ interaction                                  <chr> "We are usually at work …
## $ house_rules                                  <chr> "No smoking allowed! No …
## $ thumbnail_url                                <lgl> NA, NA, NA, NA, NA, NA, …
## $ medium_url                                   <lgl> NA, NA, NA, NA, NA, NA, …
## $ picture_url                                  <chr> "https://a0.muscache.com…
## $ xl_picture_url                               <lgl> NA, NA, NA, NA, NA, NA, …
## $ host_id                                      <dbl> 16774, 109777, 112210, 1…
## $ host_url                                     <chr> "https://www.airbnb.com/…
## $ host_name                                    <chr> "Simon", "Kari", "Oliver…
## $ host_since                                   <date> 2009-05-12, 2010-04-17,…
## $ host_location                                <chr> "Copenhagen, Capital Reg…
## $ host_about                                   <chr> "I'm currently working a…
## $ host_response_time                           <chr> "N/A", "N/A", "within a …
## $ host_response_rate                           <chr> "N/A", "N/A", "100%", "N…
## $ host_acceptance_rate                         <chr> "33%", "19%", "100%", "1…
## $ host_is_superhost                            <lgl> FALSE, FALSE, FALSE, FAL…
## $ host_thumbnail_url                           <chr> "https://a0.muscache.com…
## $ host_picture_url                             <chr> "https://a0.muscache.com…
## $ host_neighbourhood                           <chr> "Nørrebro", "Indre By", …
## $ host_listings_count                          <dbl> 1, 1, 4, 1, 1, 1, 3, 1, …
## $ host_total_listings_count                    <dbl> 1, 1, 4, 1, 1, 1, 3, 1, …
## $ host_verifications                           <chr> "['email', 'phone', 'rev…
## $ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ host_identity_verified                       <lgl> FALSE, FALSE, TRUE, FALS…
## $ street                                       <chr> "Copenhagen, Hovedstaden…
## $ neighbourhood                                <chr> "Nørrebro", "Indre By", …
## $ neighbourhood_cleansed                       <chr> "Nrrebro", "Indre By", "…
## $ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA, NA, NA, …
## $ city                                         <chr> "Copenhagen", "Copenhage…
## $ state                                        <chr> "Hovedstaden", "Hovedsta…
## $ zipcode                                      <dbl> 2200, 2100, 1210, 1650, …
## $ market                                       <chr> "Copenhagen", "Copenhage…
## $ smart_location                               <chr> "Copenhagen, Denmark", "…
## $ country_code                                 <chr> "DK", "DK", "DK", "DK", …
## $ country                                      <chr> "Denmark", "Denmark", "D…
## $ latitude                                     <dbl> 55.68798, 55.69163, 55.6…
## $ longitude                                    <dbl> 12.54571, 12.57459, 12.5…
## $ is_location_exact                            <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ property_type                                <chr> "Apartment", "House", "H…
## $ room_type                                    <chr> "Private room", "Entire …
## $ accommodates                                 <dbl> 2, 6, 12, 2, 4, 3, 3, 4,…
## $ bathrooms                                    <dbl> 1.0, 1.5, 2.5, 1.0, 1.0,…
## $ bedrooms                                     <dbl> 1, 4, 6, 1, 3, 1, 1, 2, …
## $ beds                                         <dbl> 1, 4, 7, 1, 3, 3, 2, 2, …
## $ bed_type                                     <chr> "Real Bed", "Real Bed", …
## $ amenities                                    <chr> "{TV,\"Cable TV\",Wifi,K…
## $ square_feet                                  <dbl> 97, NA, NA, NA, NA, 689,…
## $ price                                        <chr> "$365.00", "$2,398.00", …
## $ weekly_price                                 <chr> NA, NA, "$17,513.00", NA…
## $ monthly_price                                <chr> NA, NA, "$67,073.00", NA…
## $ security_deposit                             <chr> "$0.00", "$5,000.00", "$…
## $ cleaning_fee                                 <chr> "$33.00", "$1,100.00", "…
## $ guests_included                              <dbl> 1, 3, 1, 1, 1, 2, 2, 2, …
## $ extra_people                                 <chr> "$66.00", "$350.00", "$0…
## $ minimum_nights                               <dbl> 2, 3, 3, 7, 7, 2, 3, 6, …
## $ maximum_nights                               <dbl> 15, 30, 31, 14, 31, 10, …
## $ minimum_minimum_nights                       <dbl> 2, 3, 3, 3, 7, 2, 3, 6, …
## $ maximum_minimum_nights                       <dbl> 2, 3, 3, 5, 7, 2, 3, 6, …
## $ minimum_maximum_nights                       <dbl> 15, 30, 1125, 14, 1125, …
## $ maximum_maximum_nights                       <dbl> 15, 30, 1125, 14, 1125, …
## $ minimum_nights_avg_ntm                       <dbl> 2.0, 3.0, 3.0, 4.1, 7.0,…
## $ maximum_nights_avg_ntm                       <dbl> 15, 30, 1125, 14, 1125, …
## $ calendar_updated                             <chr> "5 months ago", "4 month…
## $ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ availability_30                              <dbl> 29, 28, 29, 21, 0, 0, 8,…
## $ availability_60                              <dbl> 59, 58, 59, 21, 0, 0, 8,…
## $ availability_90                              <dbl> 89, 88, 89, 21, 0, 0, 8,…
## $ availability_365                             <dbl> 89, 363, 172, 21, 0, 58,…
## $ calendar_last_scraped                        <date> 2020-06-28, 2020-06-28,…
## $ number_of_reviews                            <dbl> 168, 50, 293, 22, 90, 17…
## $ number_of_reviews_ltm                        <dbl> 1, 4, 31, 2, 0, 0, 1, 0,…
## $ first_review                                 <date> 2009-09-04, 2013-12-02,…
## $ last_review                                  <date> 2019-07-19, 2019-12-14,…
## $ review_scores_rating                         <dbl> 96, 98, 91, 98, 94, 97, …
## $ review_scores_accuracy                       <dbl> 10, 10, 10, 10, 10, 10, …
## $ review_scores_cleanliness                    <dbl> 9, 10, 9, 10, 9, 10, 10,…
## $ review_scores_checkin                        <dbl> 10, 10, 10, 10, 10, 10, …
## $ review_scores_communication                  <dbl> 10, 10, 10, 10, 9, 10, 1…
## $ review_scores_location                       <dbl> 9, 10, 10, 10, 10, 10, 1…
## $ review_scores_value                          <dbl> 9, 10, 9, 10, 9, 9, 9, 9…
## $ requires_license                             <lgl> FALSE, FALSE, FALSE, FAL…
## $ license                                      <lgl> NA, NA, NA, NA, NA, NA, …
## $ jurisdiction_names                           <lgl> NA, NA, NA, NA, NA, NA, …
## $ instant_bookable                             <lgl> FALSE, FALSE, FALSE, FAL…
## $ is_business_travel_ready                     <lgl> FALSE, FALSE, FALSE, FAL…
## $ cancellation_policy                          <chr> "moderate", "moderate", …
## $ require_guest_profile_picture                <lgl> FALSE, FALSE, FALSE, FAL…
## $ require_guest_phone_verification             <lgl> FALSE, FALSE, FALSE, FAL…
## $ calculated_host_listings_count               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, …
## $ calculated_host_listings_count_entire_homes  <dbl> 0, 1, 1, 1, 1, 1, 1, 1, …
## $ calculated_host_listings_count_private_rooms <dbl> 1, 0, 0, 0, 0, 0, 0, 0, …
## $ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reviews_per_month                            <dbl> 1.28, 0.62, 2.48, 0.18, …
Table 1: Data summary
Name listings
Number of rows 28523
Number of columns 106
_______________________
Column type frequency:
character 45
Date 5
logical 16
numeric 40
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 33 37 0 28523 0
name 57 1.00 1 211 0 26907 0
summary 1096 0.96 1 1000 0 26986 0
space 11390 0.60 1 1000 0 16806 0
description 515 0.98 1 1000 0 27748 0
experiences_offered 0 1.00 4 4 0 1 0
neighborhood_overview 12407 0.57 1 1000 0 15520 0
notes 20832 0.27 1 1000 0 7366 0
transit 11316 0.60 1 1000 0 16714 0
access 15751 0.45 1 1000 0 11251 0
interaction 14111 0.51 1 1000 0 13728 0
house_rules 13332 0.53 1 1000 0 13703 0
picture_url 0 1.00 81 146 0 28282 0
host_url 0 1.00 37 43 0 25745 0
host_name 12 1.00 1 34 0 6415 0
host_location 93 1.00 2 152 0 863 0
host_about 14028 0.51 1 3550 0 12478 34
host_response_time 11 1.00 3 18 0 5 0
host_response_rate 11 1.00 2 4 0 44 0
host_acceptance_rate 11 1.00 2 4 0 100 0
host_thumbnail_url 11 1.00 55 106 0 25669 0
host_picture_url 11 1.00 57 109 0 25669 0
host_neighbourhood 8007 0.72 1 21 0 56 0
host_verifications 0 1.00 2 156 0 316 0
street 0 1.00 10 61 0 528 0
neighbourhood 0 1.00 5 14 0 21 0
neighbourhood_cleansed 0 1.00 5 25 0 11 0
city 13 1.00 1 26 0 136 0
state 24042 0.16 1 25 0 170 0
market 864 0.97 6 21 0 10 0
smart_location 0 1.00 10 35 0 158 0
country_code 0 1.00 2 2 0 1 0
country 0 1.00 7 7 0 1 0
property_type 0 1.00 3 22 0 29 0
room_type 0 1.00 10 15 0 4 0
bed_type 0 1.00 5 13 0 5 0
amenities 0 1.00 2 1179 0 26634 0
price 0 1.00 5 10 0 611 0
weekly_price 25008 0.12 7 11 0 718 0
monthly_price 26971 0.05 7 11 0 476 0
security_deposit 13845 0.51 5 10 0 386 0
cleaning_fee 8968 0.69 5 9 0 430 0
extra_people 0 1.00 5 9 0 265 0
calendar_updated 0 1.00 5 13 0 82 0
cancellation_policy 0 1.00 8 27 0 4 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2020-06-26 2020-06-30 2020-06-27 5
host_since 11 1.00 2008-06-27 2020-06-26 2015-06-22 3379
calendar_last_scraped 0 1.00 2020-06-26 2020-06-30 2020-06-27 5
first_review 4968 0.83 2009-09-04 2020-06-28 2017-08-07 2582
last_review 4968 0.83 2011-08-10 2020-06-28 2019-07-29 1846

Variable type: logical

skim_variable n_missing complete_rate mean count
thumbnail_url 28523 0 NaN :
medium_url 28523 0 NaN :
xl_picture_url 28523 0 NaN :
host_is_superhost 11 1 0.10 FAL: 25610, TRU: 2902
host_has_profile_pic 11 1 1.00 TRU: 28445, FAL: 67
host_identity_verified 11 1 0.38 FAL: 17790, TRU: 10722
neighbourhood_group_cleansed 28523 0 NaN :
is_location_exact 0 1 0.79 TRU: 22507, FAL: 6016
has_availability 0 1 1.00 TRU: 28523
requires_license 0 1 0.00 FAL: 28523
license 28523 0 NaN :
jurisdiction_names 28523 0 NaN :
instant_bookable 0 1 0.29 FAL: 20360, TRU: 8163
is_business_travel_ready 0 1 0.00 FAL: 28523
require_guest_profile_picture 0 1 0.01 FAL: 28379, TRU: 144
require_guest_phone_verification 0 1 0.01 FAL: 28327, TRU: 196

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.044063e+07 12261143.29 6.983000e+03 1.008378e+07 1.950297e+07 3.043587e+07 4.395635e+07 ▇▇▇▅▆
scrape_id 0 1.00 2.020063e+13 0.00 2.020063e+13 2.020063e+13 2.020063e+13 2.020063e+13 2.020063e+13 ▁▁▇▁▁
host_id 0 1.00 6.731400e+07 76172316.22 5.130000e+02 1.281774e+07 3.604205e+07 9.463162e+07 3.519509e+08 ▇▂▁▁▁
host_listings_count 11 1.00 5.460000e+00 34.32 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 7.370000e+02 ▇▁▁▁▁
host_total_listings_count 11 1.00 5.460000e+00 34.32 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 7.370000e+02 ▇▁▁▁▁
zipcode 815 0.97 2.069760e+03 390.81 2.000000e+01 1.855750e+03 2.200000e+03 2.300000e+03 2.100000e+04 ▇▁▁▁▁
latitude 0 1.00 5.568000e+01 0.02 5.562000e+01 5.567000e+01 5.568000e+01 5.570000e+01 5.573000e+01 ▁▃▇▇▁
longitude 0 1.00 1.256000e+01 0.03 1.245000e+01 1.254000e+01 1.255000e+01 1.258000e+01 1.264000e+01 ▁▂▇▅▂
accommodates 0 1.00 3.320000e+00 1.63 1.000000e+00 2.000000e+00 3.000000e+00 4.000000e+00 1.600000e+01 ▇▂▁▁▁
bathrooms 12 1.00 1.080000e+00 0.28 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+01 ▇▁▁▁▁
bedrooms 29 1.00 1.550000e+00 1.06 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 1.010000e+02 ▇▁▁▁▁
beds 103 1.00 2.040000e+00 1.44 0.000000e+00 1.000000e+00 2.000000e+00 3.000000e+00 2.500000e+01 ▇▁▁▁▁
square_feet 28130 0.01 7.218900e+02 576.10 0.000000e+00 1.200000e+02 7.640000e+02 1.076000e+03 2.799000e+03 ▇▇▃▁▁
guests_included 0 1.00 1.520000e+00 1.06 1.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 1.600000e+01 ▇▁▁▁▁
minimum_nights 0 1.00 3.850000e+00 18.07 1.000000e+00 2.000000e+00 3.000000e+00 4.000000e+00 1.100000e+03 ▇▁▁▁▁
maximum_nights 0 1.00 6.206700e+02 553.19 1.000000e+00 1.500000e+01 1.125000e+03 1.125000e+03 9.999000e+03 ▇▁▁▁▁
minimum_minimum_nights 0 1.00 3.840000e+00 18.08 1.000000e+00 2.000000e+00 3.000000e+00 4.000000e+00 1.100000e+03 ▇▁▁▁▁
maximum_minimum_nights 0 1.00 4.120000e+00 19.29 1.000000e+00 2.000000e+00 3.000000e+00 4.000000e+00 1.100000e+03 ▇▁▁▁▁
minimum_maximum_nights 0 1.00 6.669200e+02 548.00 1.000000e+00 2.000000e+01 1.125000e+03 1.125000e+03 9.999000e+03 ▇▁▁▁▁
maximum_maximum_nights 0 1.00 6.703800e+02 547.39 1.000000e+00 2.000000e+01 1.125000e+03 1.125000e+03 9.999000e+03 ▇▁▁▁▁
minimum_nights_avg_ntm 0 1.00 3.970000e+00 18.35 1.000000e+00 2.000000e+00 3.000000e+00 4.000000e+00 1.100000e+03 ▇▁▁▁▁
maximum_nights_avg_ntm 0 1.00 6.684100e+02 547.20 1.000000e+00 2.000000e+01 1.125000e+03 1.125000e+03 9.999000e+03 ▇▁▁▁▁
availability_30 0 1.00 5.870000e+00 10.42 0.000000e+00 0.000000e+00 0.000000e+00 8.000000e+00 3.000000e+01 ▇▁▁▁▂
availability_60 0 1.00 1.124000e+01 20.15 0.000000e+00 0.000000e+00 0.000000e+00 1.400000e+01 6.000000e+01 ▇▁▁▁▂
availability_90 0 1.00 1.649000e+01 29.99 0.000000e+00 0.000000e+00 0.000000e+00 1.800000e+01 9.000000e+01 ▇▁▁▁▁
availability_365 0 1.00 4.947000e+01 99.23 0.000000e+00 0.000000e+00 0.000000e+00 3.500000e+01 3.650000e+02 ▇▁▁▁▁
number_of_reviews 0 1.00 1.357000e+01 27.09 0.000000e+00 1.000000e+00 5.000000e+00 1.500000e+01 6.370000e+02 ▇▁▁▁▁
number_of_reviews_ltm 0 1.00 2.740000e+00 6.64 0.000000e+00 0.000000e+00 0.000000e+00 3.000000e+00 3.760000e+02 ▇▁▁▁▁
review_scores_rating 5447 0.81 9.516000e+01 6.83 2.000000e+01 9.300000e+01 9.700000e+01 1.000000e+02 1.000000e+02 ▁▁▁▁▇
review_scores_accuracy 5468 0.81 9.730000e+00 0.66 2.000000e+00 1.000000e+01 1.000000e+01 1.000000e+01 1.000000e+01 ▁▁▁▁▇
review_scores_cleanliness 5466 0.81 9.380000e+00 0.96 2.000000e+00 9.000000e+00 1.000000e+01 1.000000e+01 1.000000e+01 ▁▁▁▁▇
review_scores_checkin 5488 0.81 9.820000e+00 0.56 2.000000e+00 1.000000e+01 1.000000e+01 1.000000e+01 1.000000e+01 ▁▁▁▁▇
review_scores_communication 5470 0.81 9.860000e+00 0.53 2.000000e+00 1.000000e+01 1.000000e+01 1.000000e+01 1.000000e+01 ▁▁▁▁▇
review_scores_location 5491 0.81 9.600000e+00 0.68 2.000000e+00 9.000000e+00 1.000000e+01 1.000000e+01 1.000000e+01 ▁▁▁▁▇
review_scores_value 5495 0.81 9.450000e+00 0.76 2.000000e+00 9.000000e+00 1.000000e+01 1.000000e+01 1.000000e+01 ▁▁▁▁▇
calculated_host_listings_count 0 1.00 4.450000e+00 28.06 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 2.810000e+02 ▇▁▁▁▁
calculated_host_listings_count_entire_homes 0 1.00 4.170000e+00 28.08 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 2.810000e+02 ▇▁▁▁▁
calculated_host_listings_count_private_rooms 0 1.00 2.600000e-01 0.73 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.200000e+01 ▇▁▁▁▁
calculated_host_listings_count_shared_rooms 0 1.00 1.000000e-02 0.33 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.300000e+01 ▇▁▁▁▁
reviews_per_month 4968 0.83 4.900000e-01 0.73 1.000000e-02 1.200000e-01 2.800000e-01 5.900000e-01 3.065000e+01 ▇▁▁▁▁

From this list can be concluded that there are 28,523 values for 106 variables. there are a couple of values that contain NA, 1 of them, cleaning_fee, stands out and is one of the variables that was flagged as important in this analysis. This will be addressed in the next section. Other variables that stand out are price, cleaning_fee and extra_people. these variables are stored as character values and need to have the $-sign removed to be stored as a numeric value. Since these variables will be used further in the analyis, this also will be resolved in the next section where we clean the data and run summary statistics.

Computing summary statistics of the variables of interest

In this section we will clean the data and run summary statistics that will be visualised in the next section. First the values stored as characters are converted to numeric values and the $-sign removed.

# Make price into a numeric variable
listings$price = as.numeric(gsub("[\\$,]", "", listings$price)) 

# Make cleaning fee into a numeric variable
listings$cleaning_fee = as.numeric(gsub("[\\$,]", "", listings$cleaning_fee)) 

# Make extra people a numeric variable
listings$extra_people = as.numeric(gsub("[\\$,]", "", listings$extra_people))

Following this, the NA’s in the cleaning_fee values will be converted to zero. Assuming the NA’s are from hosts not providing an input for cleaning fees, the NA’s should equal zero.

# Change NA cleaning fee values to 0, assuming that 0 is meant for NA
listings <- listings %>%   
  mutate(cleaning_fee = case_when(
    is.na(cleaning_fee) ~ 0, 
    TRUE ~ cleaning_fee
  ))

The property type categorical variables has too many categories to be summarised. To reduce this, the 4 most substantial categories have been specified, the remainder will be classed as other.

# Create a list of property types and the count
kbl(listings %>% 
  group_by(property_type) %>%
  summarise(count= n()) %>%
  arrange(desc(count))) %>%
  kable_styling(bootstrap_options=c("striped","hover","condensed","responsive"))
## `summarise()` ungrouping output (override with `.groups` argument)
property_type count
Apartment 23942
Condominium 1673
House 1365
Townhouse 554
Serviced apartment 331
Loft 231
Villa 190
Hostel 28
Guesthouse 23
Bungalow 22
Guest suite 22
Bed and breakfast 21
Houseboat 21
Boat 20
Hotel 20
Other 14
Tiny house 14
Cabin 10
Boutique hotel 5
Cottage 4
Hut 3
Aparthotel 2
Barn 2
Camper/RV 1
Casa particular (Cuba) 1
Earth house 1
Island 1
Lighthouse 1
Tent 1
# Take top 4 property_type and create a new column and assigning property_type in groups
listings <- listings %>%
  mutate(prop_type_simplified = case_when(
  property_type %in% c("Apartment","Condominium", "House","Townhouse") ~ property_type, 
  TRUE ~ "Other"
  ))

# Check if all the variables are in the correct type
skim(listings) 
(#tab:cleaning data by reducing property type criteria)Data summary
Name listings
Number of rows 28523
Number of columns 107
_______________________
Column type frequency:
character 43
Date 5
logical 16
numeric 43
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 33 37 0 28523 0
name 57 1.00 1 211 0 26907 0
summary 1096 0.96 1 1000 0 26986 0
space 11390 0.60 1 1000 0 16806 0
description 515 0.98 1 1000 0 27748 0
experiences_offered 0 1.00 4 4 0 1 0
neighborhood_overview 12407 0.57 1 1000 0 15520 0
notes 20832 0.27 1 1000 0 7366 0
transit 11316 0.60 1 1000 0 16714 0
access 15751 0.45 1 1000 0 11251 0
interaction 14111 0.51 1 1000 0 13728 0
house_rules 13332 0.53 1 1000 0 13703 0
picture_url 0 1.00 81 146 0 28282 0
host_url 0 1.00 37 43 0 25745 0
host_name 12 1.00 1 34 0 6415 0
host_location 93 1.00 2 152 0 863 0
host_about 14028 0.51 1 3550 0 12478 34
host_response_time 11 1.00 3 18 0 5 0
host_response_rate 11 1.00 2 4 0 44 0
host_acceptance_rate 11 1.00 2 4 0 100 0
host_thumbnail_url 11 1.00 55 106 0 25669 0
host_picture_url 11 1.00 57 109 0 25669 0
host_neighbourhood 8007 0.72 1 21 0 56 0
host_verifications 0 1.00 2 156 0 316 0
street 0 1.00 10 61 0 528 0
neighbourhood 0 1.00 5 14 0 21 0
neighbourhood_cleansed 0 1.00 5 25 0 11 0
city 13 1.00 1 26 0 136 0
state 24042 0.16 1 25 0 170 0
market 864 0.97 6 21 0 10 0
smart_location 0 1.00 10 35 0 158 0
country_code 0 1.00 2 2 0 1 0
country 0 1.00 7 7 0 1 0
property_type 0 1.00 3 22 0 29 0
room_type 0 1.00 10 15 0 4 0
bed_type 0 1.00 5 13 0 5 0
amenities 0 1.00 2 1179 0 26634 0
weekly_price 25008 0.12 7 11 0 718 0
monthly_price 26971 0.05 7 11 0 476 0
security_deposit 13845 0.51 5 10 0 386 0
calendar_updated 0 1.00 5 13 0 82 0
cancellation_policy 0 1.00 8 27 0 4 0
prop_type_simplified 0 1.00 5 11 0 5 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2020-06-26 2020-06-30 2020-06-27 5
host_since 11 1.00 2008-06-27 2020-06-26 2015-06-22 3379
calendar_last_scraped 0 1.00 2020-06-26 2020-06-30 2020-06-27 5
first_review 4968 0.83 2009-09-04 2020-06-28 2017-08-07 2582
last_review 4968 0.83 2011-08-10 2020-06-28 2019-07-29 1846

Variable type: logical

skim_variable n_missing complete_rate mean count
thumbnail_url 28523 0 NaN :
medium_url 28523 0 NaN :
xl_picture_url 28523 0 NaN :
host_is_superhost 11 1 0.10 FAL: 25610, TRU: 2902
host_has_profile_pic 11 1 1.00 TRU: 28445, FAL: 67
host_identity_verified 11 1 0.38 FAL: 17790, TRU: 10722
neighbourhood_group_cleansed 28523 0 NaN :
is_location_exact 0 1 0.79 TRU: 22507, FAL: 6016
has_availability 0 1 1.00 TRU: 28523
requires_license 0 1 0.00 FAL: 28523
license 28523 0 NaN :
jurisdiction_names 28523 0 NaN :
instant_bookable 0 1 0.29 FAL: 20360, TRU: 8163
is_business_travel_ready 0 1 0.00 FAL: 28523
require_guest_profile_picture 0 1 0.01 FAL: 28379, TRU: 144
require_guest_phone_verification 0 1 0.01 FAL: 28327, TRU: 196

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.044063e+07 12261143.29 6.983000e+03 1.008378e+07 1.950297e+07 3.043587e+07 4.395635e+07 ▇▇▇▅▆
scrape_id 0 1.00 2.020063e+13 0.00 2.020063e+13 2.020063e+13 2.020063e+13 2.020063e+13 2.020063e+13 ▁▁▇▁▁
host_id 0 1.00 6.731400e+07 76172316.22 5.130000e+02 1.281774e+07 3.604205e+07 9.463162e+07 3.519509e+08 ▇▂▁▁▁
host_listings_count 11 1.00 5.460000e+00 34.32 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 7.370000e+02 ▇▁▁▁▁
host_total_listings_count 11 1.00 5.460000e+00 34.32 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 7.370000e+02 ▇▁▁▁▁
zipcode 815 0.97 2.069760e+03 390.81 2.000000e+01 1.855750e+03 2.200000e+03 2.300000e+03 2.100000e+04 ▇▁▁▁▁
latitude 0 1.00 5.568000e+01 0.02 5.562000e+01 5.567000e+01 5.568000e+01 5.570000e+01 5.573000e+01 ▁▃▇▇▁
longitude 0 1.00 1.256000e+01 0.03 1.245000e+01 1.254000e+01 1.255000e+01 1.258000e+01 1.264000e+01 ▁▂▇▅▂
accommodates 0 1.00 3.320000e+00 1.63 1.000000e+00 2.000000e+00 3.000000e+00 4.000000e+00 1.600000e+01 ▇▂▁▁▁
bathrooms 12 1.00 1.080000e+00 0.28 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+01 ▇▁▁▁▁
bedrooms 29 1.00 1.550000e+00 1.06 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 1.010000e+02 ▇▁▁▁▁
beds 103 1.00 2.040000e+00 1.44 0.000000e+00 1.000000e+00 2.000000e+00 3.000000e+00 2.500000e+01 ▇▁▁▁▁
square_feet 28130 0.01 7.218900e+02 576.10 0.000000e+00 1.200000e+02 7.640000e+02 1.076000e+03 2.799000e+03 ▇▇▃▁▁
price 0 1.00 8.491900e+02 1066.81 0.000000e+00 4.980000e+02 6.980000e+02 9.960000e+02 6.917500e+04 ▇▁▁▁▁
cleaning_fee 0 1.00 2.103900e+02 249.69 0.000000e+00 0.000000e+00 1.500000e+02 3.000000e+02 4.000000e+03 ▇▁▁▁▁
guests_included 0 1.00 1.520000e+00 1.06 1.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 1.600000e+01 ▇▁▁▁▁
extra_people 0 1.00 6.388000e+01 126.23 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+02 2.024000e+03 ▇▁▁▁▁
minimum_nights 0 1.00 3.850000e+00 18.07 1.000000e+00 2.000000e+00 3.000000e+00 4.000000e+00 1.100000e+03 ▇▁▁▁▁
maximum_nights 0 1.00 6.206700e+02 553.19 1.000000e+00 1.500000e+01 1.125000e+03 1.125000e+03 9.999000e+03 ▇▁▁▁▁
minimum_minimum_nights 0 1.00 3.840000e+00 18.08 1.000000e+00 2.000000e+00 3.000000e+00 4.000000e+00 1.100000e+03 ▇▁▁▁▁
maximum_minimum_nights 0 1.00 4.120000e+00 19.29 1.000000e+00 2.000000e+00 3.000000e+00 4.000000e+00 1.100000e+03 ▇▁▁▁▁
minimum_maximum_nights 0 1.00 6.669200e+02 548.00 1.000000e+00 2.000000e+01 1.125000e+03 1.125000e+03 9.999000e+03 ▇▁▁▁▁
maximum_maximum_nights 0 1.00 6.703800e+02 547.39 1.000000e+00 2.000000e+01 1.125000e+03 1.125000e+03 9.999000e+03 ▇▁▁▁▁
minimum_nights_avg_ntm 0 1.00 3.970000e+00 18.35 1.000000e+00 2.000000e+00 3.000000e+00 4.000000e+00 1.100000e+03 ▇▁▁▁▁
maximum_nights_avg_ntm 0 1.00 6.684100e+02 547.20 1.000000e+00 2.000000e+01 1.125000e+03 1.125000e+03 9.999000e+03 ▇▁▁▁▁
availability_30 0 1.00 5.870000e+00 10.42 0.000000e+00 0.000000e+00 0.000000e+00 8.000000e+00 3.000000e+01 ▇▁▁▁▂
availability_60 0 1.00 1.124000e+01 20.15 0.000000e+00 0.000000e+00 0.000000e+00 1.400000e+01 6.000000e+01 ▇▁▁▁▂
availability_90 0 1.00 1.649000e+01 29.99 0.000000e+00 0.000000e+00 0.000000e+00 1.800000e+01 9.000000e+01 ▇▁▁▁▁
availability_365 0 1.00 4.947000e+01 99.23 0.000000e+00 0.000000e+00 0.000000e+00 3.500000e+01 3.650000e+02 ▇▁▁▁▁
number_of_reviews 0 1.00 1.357000e+01 27.09 0.000000e+00 1.000000e+00 5.000000e+00 1.500000e+01 6.370000e+02 ▇▁▁▁▁
number_of_reviews_ltm 0 1.00 2.740000e+00 6.64 0.000000e+00 0.000000e+00 0.000000e+00 3.000000e+00 3.760000e+02 ▇▁▁▁▁
review_scores_rating 5447 0.81 9.516000e+01 6.83 2.000000e+01 9.300000e+01 9.700000e+01 1.000000e+02 1.000000e+02 ▁▁▁▁▇
review_scores_accuracy 5468 0.81 9.730000e+00 0.66 2.000000e+00 1.000000e+01 1.000000e+01 1.000000e+01 1.000000e+01 ▁▁▁▁▇
review_scores_cleanliness 5466 0.81 9.380000e+00 0.96 2.000000e+00 9.000000e+00 1.000000e+01 1.000000e+01 1.000000e+01 ▁▁▁▁▇
review_scores_checkin 5488 0.81 9.820000e+00 0.56 2.000000e+00 1.000000e+01 1.000000e+01 1.000000e+01 1.000000e+01 ▁▁▁▁▇
review_scores_communication 5470 0.81 9.860000e+00 0.53 2.000000e+00 1.000000e+01 1.000000e+01 1.000000e+01 1.000000e+01 ▁▁▁▁▇
review_scores_location 5491 0.81 9.600000e+00 0.68 2.000000e+00 9.000000e+00 1.000000e+01 1.000000e+01 1.000000e+01 ▁▁▁▁▇
review_scores_value 5495 0.81 9.450000e+00 0.76 2.000000e+00 9.000000e+00 1.000000e+01 1.000000e+01 1.000000e+01 ▁▁▁▁▇
calculated_host_listings_count 0 1.00 4.450000e+00 28.06 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 2.810000e+02 ▇▁▁▁▁
calculated_host_listings_count_entire_homes 0 1.00 4.170000e+00 28.08 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 2.810000e+02 ▇▁▁▁▁
calculated_host_listings_count_private_rooms 0 1.00 2.600000e-01 0.73 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.200000e+01 ▇▁▁▁▁
calculated_host_listings_count_shared_rooms 0 1.00 1.000000e-02 0.33 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.300000e+01 ▇▁▁▁▁
reviews_per_month 4968 0.83 4.900000e-01 0.73 1.000000e-02 1.200000e-01 2.800000e-01 5.900000e-01 3.065000e+01 ▇▁▁▁▁

To only have a list of variables that will be used in the analysis, redundant variables are filtered out. As the analysis focuses on short holiday stays, the long stay listings need to be removed from the list. By looking at the minimum nights on the listings, it can be determined if these addresses are viable for short stays.

minimum_nights count
2 8500
3 6479
1 5444
4 3117
5 2346
7 997
6 702
14 198
10 170
30 136
8 52
20 49
15 35
21 33
60 23
9 22
12 21
90 21
13 20
25 16
28 14
50 14
100 8
31 7
11 6
23 6
29 6
45 5
16 4
19 4
200 4
1000 4
17 3
40 3
70 3
80 3
120 3
18 2
22 2
24 2
26 2
27 2
49 2
85 2
160 2
180 2
360 2
365 2
999 2
34 1
35 1
39 1
42 1
43 1
48 1
56 1
59 1
61 1
66 1
75 1
89 1
92 1
102 1
107 1
150 1
270 1
300 1
430 1
500 1
1100 1

In this summary it can be seen that after 5 minimum days the numbers of listing drop significantly.The majority of listings are within the 1 to 4 minimum days stay. Therefore this will give a good indication of the listings intended for short, holiday, stays. In the next section the data is filtered to only feature a minimum of 4 nights or less. Also, the variables that will be used in the analysis are selected, so to only have relevant variables left to make the analysis more concise and clear.

Let’s have a look at the data.

# Check if all the variables are in the correct type
skim(listings_clean) 
(#tab:skim the new list)Data summary
Name listings_clean
Number of rows 28523
Number of columns 20
_______________________
Column type frequency:
character 5
logical 2
numeric 13
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1 33 37 0 28523 0
prop_type_simplified 0 1 5 11 0 5 0
room_type 0 1 10 15 0 4 0
neighbourhood_cleansed 0 1 5 25 0 11 0
cancellation_policy 0 1 8 27 0 4 0

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 11 1 0.10 FAL: 25610, TRU: 2902
is_location_exact 0 1 0.79 TRU: 22507, FAL: 6016

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 20440633.59 12261143.29 6983.00 10083785.00 19502966.00 30435871.00 43956346.00 ▇▇▇▅▆
price 0 1.00 849.19 1066.81 0.00 498.00 698.00 996.00 69175.00 ▇▁▁▁▁
guests_included 0 1.00 1.52 1.06 1.00 1.00 1.00 2.00 16.00 ▇▁▁▁▁
cleaning_fee 0 1.00 210.39 249.69 0.00 0.00 150.00 300.00 4000.00 ▇▁▁▁▁
extra_people 0 1.00 63.88 126.23 0.00 0.00 0.00 100.00 2024.00 ▇▁▁▁▁
number_of_reviews 0 1.00 13.57 27.09 0.00 1.00 5.00 15.00 637.00 ▇▁▁▁▁
review_scores_rating 5447 0.81 95.16 6.83 20.00 93.00 97.00 100.00 100.00 ▁▁▁▁▇
bathrooms 12 1.00 1.08 0.28 0.00 1.00 1.00 1.00 10.00 ▇▁▁▁▁
bedrooms 29 1.00 1.55 1.06 0.00 1.00 1.00 2.00 101.00 ▇▁▁▁▁
beds 103 1.00 2.04 1.44 0.00 1.00 2.00 3.00 25.00 ▇▁▁▁▁
accommodates 0 1.00 3.32 1.63 1.00 2.00 3.00 4.00 16.00 ▇▂▁▁▁
longitude 0 1.00 12.56 0.03 12.45 12.54 12.55 12.58 12.64 ▁▂▇▅▂
latitude 0 1.00 55.68 0.02 55.62 55.67 55.68 55.70 55.73 ▁▃▇▇▁

Visualisations

Now that the data is clean, a selection of useful data has been made, and filtered to only feature our criteria for a short stay holiday listing, visualisations of data can be made. This lays the foundation of any correlations and questions the regression model will have to explain and answer.

location

In order to get an idea on the concentration of apartment listings in the area of Copenhagen, we plotted a map showing the number of flat available for rent for 4 nights or less. We used clusters instead of point to make the map more readable and to get a better idea of the number of listings in each area.

# Take the filtered listings with a min of 4 nights and create a map of the apartments in Copenhagen
listings_filtered_4nights %>% 
  leaflet() %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1,
                   fillOpacity = 0.4, 
                   popup = ~listing_url,
                   label = ~property_type,
                   #To get a better overview of the number and location of apartments in Copenhagen, a clustered map was chosen instead of the original point mapping
                   clusterOptions = markerClusterOptions()
                   )

The map shows the number of available flats are located in the city center. The more we get away from the center the fewer the number of flats available.

To gauge the price difference in each area in Copenhagen, we plotted a heatmap showing the more expensive areas in a brighter red colour. Also we plotted a bar chart to get an idea of the median price for each area. The reason for using median over mean is because the listings in Copenhagen have very extreme values and they skew the mean too much for certain areas.

#maps price to seven colors using quantiles
qpal <- colorQuantile("Reds", listings_filtered_4nights$price, n = 7)

listings_filtered_4nights %>% 
leaflet() %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1,
                   fillOpacity = 0.4, 
                   popup = ~listing_url,
                   label = ~property_type,
                   color = ~qpal(price))

We can conclude that the harbour area in Copehagen has the most expensive listings. Therefore we suspect that location will have a high coefficient when we will be running te regression models.

#Calculated median price per night for each neighborhood
median_per_neighborhood <- listings_filtered_4nights %>% 
  group_by(neighbourhood_cleansed) %>% 
  summarise(median_price = median(price))
#Price per night for each neighborhood
median_per_neighborhood %>% 
  ggplot(aes(
    x = reorder(neighbourhood_cleansed, median_price),
    y = median_price
  )) +
    geom_col(fill = "#FF5A5F") +
  scale_fill_manual(values="#FF5A5F")+
  labs(
    title = "Median price per night per neighborhood",
    x = "",
    y = "Median price per neighborhood",
    caption = "Source: Airbnb"
  ) +
  coord_flip() +
  theme_classic() +
  NULL

The Indre By area in Copenhagen is the central harbour area and is the most expensive area for AirBnB prices. This confirms that the neighborhood variable will have a significant impact on the price variable.

accommodation type

Next up, we have plotted a number of bar charts to display the price relationship with the types of accommodation.

Median price per night per property type: The histogram showed that among all the property types, the lowest median price per night is allocated to the appartments

Median price per room type: The histogram showed that hotel rooms are generally more expensive than the other types of rooms available.

# Calculate median price per night for each room type
median_per_proptype <- listings_filtered_4nights %>% 
  group_by(prop_type_simplified) %>% 
  summarise(median_price = median(price))

# Plot a bar chart with median price per night for each property type
median_per_proptype %>% 
  ggplot(aes(
    x = reorder(prop_type_simplified, desc(median_price)),
    y = median_price
  )) +
  geom_col(fill = "#FF5A5F") +
  scale_fill_manual(values="#FF5A5F")+
  labs(
    title = "Median price per night per property type",
    x = "",
    y = "Median price per night",
    caption = "Source: AirBnB"
  ) +
  NULL

# Calculate median price per night for each room type
median_per_roomtype <- listings_filtered_4nights %>% 
  group_by(room_type) %>% 
  summarise(median_price = median(price))

# Price per night for each room type
median_per_roomtype %>% 
  ggplot(aes(
    x = reorder(room_type, desc(median_price)),
    y = median_price
  )) +
  geom_col(fill = "#FF5A5F") +
  scale_fill_manual(values="#FF5A5F")+
  labs(
    title = "Median price per night per room type",
    x = "",
    y = "Median price per night",
    caption = "Source: AirBnB"
  ) +
  NULL

From this we can conclude that both room type and property type have an influence on the price.

rating

Following from this, we wanted to know what influence rating has on the price. For this we used both data on the review scores and whether the listing is posted by a superhost.

#Plotted a density graph to display the relationship with price in being a superhost
listings_clean %>%
  ggplot(aes( x = log(price),
  #colour the graph by being a superhost or not and filtering out the NA values
  colour = !is.na(host_is_superhost))) + 
  geom_density() +    
  labs(      
      title = "Density plot for prices on whether host is superhost",
      x = "price per night (log)",
      y = "density",
      caption = "Source: AirBnB",
      color = "(Non)superhost\n"   ) + 
      theme_classic() + 
  NULL
## Warning: Removed 2 rows containing non-finite values (stat_density).

#Plotted a histogram to display the distribution of ratings and to show if this is affected by being a superhost

listings_clean %>%
  #Filtered out the NA's in superhost
  filter(!is.na(host_is_superhost)) %>% 
  ggplot(aes(
    x = review_scores_rating,
    fill = host_is_superhost
  )) +
  # set the binwidth to 1 to match the scoring bins in the original data
  geom_histogram(binwidth = 1) +
      labs(
      title = "Histogram for review score distribution with (non)superhosts",
      x = "Review scores",
      y = "Count",
      caption = "Source: AirBnB"
    ) +
  #change name of legend
  scale_fill_discrete(
                      name = NULL,
                      labels = c("Non-superhost", "Superhost")
                      ) +
  theme_classic() +
  NULL
## Warning: Removed 5442 rows containing non-finite values (stat_bin).

#plotted a scatter graph showing the relationship between review score and price
listings_clean %>%
  ggplot(aes(
    x = review_scores_rating,
    y = price
  )) +
  geom_point() +
  #Applied a logscale to make the relationship more visible
  scale_y_log10() +
  #Applied a linear line to display any relationship in with price with a higher review score
  geom_smooth(method = lm) +
      labs(
      title = "Scatter plot showing relationship between review scores and price",
      x = "Review scores",
      y = "Price per night (log)",
      caption = "Source: Airbnb"
    ) + 
  theme_classic() +
  NULL
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 5448 rows containing non-finite values (stat_smooth).
## Warning: Removed 5447 rows containing missing values (geom_point).

From this we can conclude that being a superhost does have an influence on the price. People would prefer a superhost, so it seems logical that the average price for a superhost is higher than a listing under a normal host. The ratings however don’t seem to be related to being a superhost. In fact the perfect scores are more among non-superhosts. This could be because of the higher expectations set by booking via a superhost, this could make guests more critical when writing a review. finally, there does not seem to be a strong correlation with a rating score and the price. Hosts will not alter their prices to their review scores.

GGpairs

To display a comprehensive overview of the correlation between variables and the price we plotted a GGpairs plot. Using all the selected variables.

# Display the correlation between four important variables

listings_clean %>%
  select(price, review_scores_rating, bedrooms, bathrooms, accommodates) %>% 
  ggpairs(cardinality_threshold=NULL)
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 5447 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 29 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 12 rows containing missing values
## Warning: Removed 5447 rows containing missing values (geom_point).
## Warning: Removed 5447 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 5458 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 5454 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 5447 rows containing missing values
## Warning: Removed 29 rows containing missing values (geom_point).
## Warning: Removed 5458 rows containing missing values (geom_point).
## Warning: Removed 29 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 38 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 29 rows containing missing values
## Warning: Removed 12 rows containing missing values (geom_point).
## Warning: Removed 5454 rows containing missing values (geom_point).
## Warning: Removed 38 rows containing missing values (geom_point).
## Warning: Removed 12 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 12 rows containing missing values
## Warning: Removed 5447 rows containing missing values (geom_point).
## Warning: Removed 29 rows containing missing values (geom_point).
## Warning: Removed 12 rows containing missing values (geom_point).

REGRESSION

Basic explanatory variables

Before we start with the regression modeling, we need to filter the data to represent 2 people staying for 4 nights, including the cleaning fee.

# Create new variable for 4 nights using price, guests_included, cleaning_fee and extra_people
listings_new <- listings_clean %>%
  mutate(price_4_nights = ifelse(
    guests_included <= 1,
    (price + extra_people) * 4 + cleaning_fee,
    (price) * 4 + cleaning_fee
  ))

The following plots show the density for the total price for 4 nights wjth and without a log function.

# Density plots for price_4_nights
density.default(listings_new$price_4_nights)
## 
## Call:
##  density.default(x = listings_new$price_4_nights)
## 
## Data: listings_new$price_4_nights (28523 obs.);  Bandwidth 'bw' = 173
## 
##        x                  y            
##  Min.   :  -518.9   Min.   :0.000e+00  
##  1st Qu.: 68915.5   1st Qu.:0.000e+00  
##  Median :138350.0   Median :0.000e+00  
##  Mean   :138350.0   Mean   :4.589e-06  
##  3rd Qu.:207784.5   3rd Qu.:2.000e-10  
##  Max.   :277218.9   Max.   :3.565e-04
# Calculate the most frequently occurring price (mode) for 4 nights 
max_p4n <- density(listings_new$price_4_nights)$x[which.max(density(listings_new$price_4_nights)$y)]
max_p4n
## [1] 2742.17
# Plot density of price_4_nights
ggplot(
  listings_new,
  aes(x = price_4_nights)) +
  geom_density() +
  # Plot vertical line to show maximum value
  geom_vline(xintercept =  max_p4n) +
  theme_classic() +
  NULL

# Plot density of log(price_4_nights)
ggplot(listings_new, aes(x = log(price_4_nights))) +
  geom_density() +
  theme_classic() +
  NULL
## Warning: Removed 1 rows containing non-finite values (stat_density).

we should use variable log(price_4_nights). Looking at the density plots of price_4_nights and log(price_4_nights), we can see clearly that log(price_4_nights) is more close to a normal distribution, while most values in price_4_nights are cramped within a very small range.

Model 1

Fitting a regression model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating

# Create a new dataset with the new variable of log(price_4_nights)
listings_log <- listings_new %>%
  mutate(price_4_nights_log = log10(price_4_nights)) 
  
# Create model1 of price_4_nights with explanatory variables prop_type_simplified, number_of_reviews and review_scores_ratings
model1 <- lm(price_4_nights_log ~
             prop_type_simplified +
             number_of_reviews +
             review_scores_rating,
             data = listings_log
             )

model1 %>% tidy(conf.int=TRUE)
termestimatestd.errorstatisticp.valueconf.lowconf.high
(Intercept)3.24    0.0198  164   0       3.21    3.28    
prop_type_simplifiedCondominium0.033   0.00576 5.749.66e-090.0218  0.0443  
prop_type_simplifiedHouse0.119   0.00693 17.1 4.35e-650.105   0.132   
prop_type_simplifiedOther0.178   0.00872 20.4 1.7e-91 0.161   0.195   
prop_type_simplifiedTownhouse0.177   0.0102  17.4 2.2e-67 0.157   0.197   
number_of_reviews0.0001884.84e-053.890.0001019.34e-050.000283
review_scores_rating0.00235 0.00020711.3 9.89e-300.00194 0.00276 
model1 %>% glance()
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.0460.04570.2151853.13e-23162.77e+03-5.53e+03-5.46e+031.06e+032306923076
# Check residuals
autoplot(model1)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Ratings (review_scores_rating) is a significant predictor. Controlling for other variables, every 1 point increase in rating is associated with $ 0.002 increase in log10(price_4_nights).

Interpretation of prop_type_simplified predictor:

Property type (prop_type_simplified) is a significant predictor. Controlling for other variables, switching property type from apartment to condomunium would result in 0.034 increase in log10(price_4_nights). Similarly, switching property type from apartment to house would result in 0.11 increase in log10(price_4_nights); switching property type from apartment to townhouse would result in 0.17 increase in log10(price_4_nights); switching property type from apartment to property types other than the aforementioned 3 types would result in an average of 0.18 increase in log10(price_4_nights).

Model 1: 4%

Model 2

We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model.

# Create model2 by adding room_type to model1
model2 <- lm(price_4_nights_log ~
             prop_type_simplified +
             number_of_reviews +
             review_scores_rating +
             room_type, 
             data = listings_log
             )

model2 %>% tidy(conf.int=TRUE)
termestimatestd.errorstatisticp.valueconf.lowconf.high
(Intercept)3.36    0.0177  189   0       3.32    3.39    
prop_type_simplifiedCondominium0.0168  0.00513 3.270.00107 0.00674 0.0269  
prop_type_simplifiedHouse0.126   0.00618 20.4 1.8e-91 0.114   0.138   
prop_type_simplifiedOther0.168   0.00797 21.1 1.14e-970.152   0.184   
prop_type_simplifiedTownhouse0.166   0.00905 18.3 1.98e-740.148   0.184   
number_of_reviews0.0006684.36e-0515.3 1.33e-520.0005820.000753
review_scores_rating0.00154 0.0001858.338.6e-17 0.00118 0.00191 
room_typeHotel room0.112   0.0342  3.280.00104 0.0452  0.179   
room_typePrivate room-0.271   0.00355 -76.3 0       -0.278   -0.264   
room_typeShared room-0.362   0.0246  -14.7 5.51e-49-0.41    -0.314   
model2 %>% glance()
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.2430.2430.191823095.44e+03-1.09e+04-1.08e+048432306623076
# Check residuals
autoplot(model2)

Interpretation of room_type predictor:

The result of model2 regression shows that rooms_type is a significant predictor. Controlling for other variables, switching room type from entire house to hotel room would result in 0.11 increase in log10(price_4_nights). Nevertheless, switching from entire house to private room would result in 0.27 drop in log10(price_4_nights); switching from entire house to shared room would result in 0.38 drop in log10(price_4_nights).

Model 2: 24%

Exploring additional explanatory variables

Model 3

Are the number of bathrooms, bedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights?

# Create model3a by adding bathrooms, bedrooms, beds, accommodates to model2 
model3a<-lm(price_4_nights_log ~
            prop_type_simplified +
             number_of_reviews +
             review_scores_rating +
             room_type + 
             bathrooms +
             bedrooms +
             beds + 
             accommodates, 
            data = listings_log)

model3a %>% 
  tidy(conf.int=TRUE)
termestimatestd.errorstatisticp.valueconf.lowconf.high
(Intercept)3.08    0.0163  189    0        3.05    3.12    
prop_type_simplifiedCondominium0.0105  0.00452 2.31 0.0207   0.0016  0.0193  
prop_type_simplifiedHouse-0.0476  0.00594 -8.02 1.08e-15 -0.0593  -0.036   
prop_type_simplifiedOther0.0906  0.00709 12.8  2.97e-37 0.0767  0.105   
prop_type_simplifiedTownhouse0.00279 0.00835 0.3340.738    -0.0136  0.0191  
number_of_reviews0.0005963.85e-0515.5  7.78e-54 0.0005210.000672
review_scores_rating0.00166 0.00016410.2  3.24e-24 0.00134 0.00199 
room_typeHotel room0.223   0.0302  7.38 1.65e-13 0.164   0.282   
room_typePrivate room-0.18    0.00338 -53.2  0        -0.187   -0.174   
room_typeShared room-0.261   0.0217  -12    3e-33        -0.303   -0.218   
bathrooms0.0581  0.0047  12.4  4.54e-35 0.0489  0.0673  
bedrooms0.0501  0.00224 22.3  1.83e-1090.0457  0.0545  
beds-0.00273 0.00125 -2.19 0.0283   -0.00517 -0.00029 
accommodates0.0389  0.00131 29.6  2.69e-1890.0363  0.0415  
model3a %>%
  glance()
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.4120.4120.1681.24e+030138.36e+03-1.67e+04-1.66e+046532302123035
# Test VIF
car::vif(model3a)
##                          GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.373951  4        1.040510
## number_of_reviews    1.029840  1        1.014810
## review_scores_rating 1.012554  1        1.006257
## room_type            1.272593  3        1.040994
## bathrooms            1.332643  1        1.154401
## bedrooms             3.071499  1        1.752569
## beds                 2.550304  1        1.596967
## accommodates         3.628682  1        1.904910
# Check residuals
autoplot(model3a)

The number of bathrooms, bedrooms and size of the house (accommodates) are significant predictors of log10(price_4_nights) and are all positively related with price_4_nights, while number of beds is not a significant predictor. Coefficients show that each additional bathroom is associated with 0.06 increase in log10(price_4_nights); each additional bedroom is associated with 0.05 increase in log10(price_4_nights). There might be a problem with multi-collinearity since intuitively the number of bathrooms and bedrooms should have a positive relationship with size of the house. A further analysis on VIF shows that size of the house has slight correlation with other predictors but is within an acceptable range.

# Create model3 by removing beds from model3a 
model3<-lm(price_4_nights_log ~
           prop_type_simplified +
           number_of_reviews +
           review_scores_rating +
           room_type + 
           bathrooms +
           bedrooms +
           accommodates, 
           data = listings_log
           )

model3 %>% tidy(conf.int=TRUE)
termestimatestd.errorstatisticp.valueconf.lowconf.high
(Intercept)3.08    0.0163  189    0        3.05    3.12    
prop_type_simplifiedCondominium0.0106  0.00452 2.34 0.0195   0.0017  0.0194  
prop_type_simplifiedHouse-0.0481  0.00593 -8.11 5.24e-16 -0.0598  -0.0365  
prop_type_simplifiedOther0.0905  0.00709 12.7  4.17e-37 0.0766  0.104   
prop_type_simplifiedTownhouse0.00227 0.00834 0.2730.785    -0.0141  0.0186  
number_of_reviews0.0005983.85e-0515.5  3.5e-54  0.0005230.000674
review_scores_rating0.00167 0.00016410.2  1.7e-24  0.00135 0.00199 
room_typeHotel room0.223   0.0302  7.39 1.58e-13 0.164   0.282   
room_typePrivate room-0.181   0.00338 -53.5  0        -0.187   -0.174   
room_typeShared room-0.262   0.0217  -12.1  1.91e-33 -0.304   -0.219   
bathrooms0.0581  0.0047  12.4  4.75e-35 0.0489  0.0673  
bedrooms0.0488  0.00216 22.6  1.72e-1110.0446  0.0531  
accommodates0.0376  0.00118 31.9  1.11e-2180.0353  0.0399  
model3 %>% glance()
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.4130.4120.1681.35e+030128.36e+03-1.67e+04-1.66e+046542304723060
# Test VIF
car::vif(model3)
##                          GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.370590  4        1.040192
## number_of_reviews    1.029562  1        1.014673
## review_scores_rating 1.011882  1        1.005924
## room_type            1.271896  3        1.040899
## bathrooms            1.332684  1        1.154419
## bedrooms             2.862720  1        1.691957
## accommodates         2.929976  1        1.711717
# Check residuals
autoplot(model3)

Model 3: 41%

Model 4

Do superhosts command a pricing premium, after controlling for other variables?

# Create model4 by adding host_is_superhost to model3
model4<-lm(price_4_nights_log ~
             prop_type_simplified +
             number_of_reviews +
             review_scores_rating +
             room_type + 
             bathrooms +
             bedrooms +
             accommodates +
             host_is_superhost,
            data = listings_log
           )
model4 %>% tidy(conf.int=TRUE)
termestimatestd.errorstatisticp.valueconf.lowconf.high
(Intercept)3.1     0.0164  189   0        3.07    3.13    
prop_type_simplifiedCondominium0.00988 0.00452 2.190.0287   0.00103 0.0187  
prop_type_simplifiedHouse-0.0478  0.00592 -8.077.47e-16 -0.0594  -0.0362  
prop_type_simplifiedOther0.089   0.00708 12.6 4.07e-36 0.0751  0.103   
prop_type_simplifiedTownhouse0.00141 0.00833 0.170.865    -0.0149  0.0177  
number_of_reviews0.0004844.01e-0512.1 2.24e-33 0.0004050.000562
review_scores_rating0.00145 0.0001658.811.28e-18 0.00113 0.00178 
room_typeHotel room0.226   0.0301  7.496.91e-14 0.167   0.285   
room_typePrivate room-0.183   0.00338 -54.1 0        -0.189   -0.176   
room_typeShared room-0.261   0.0216  -12.1 2.06e-33 -0.304   -0.219   
bathrooms0.0582  0.00469 12.4 3.14e-35 0.049   0.0673  
bedrooms0.0491  0.00216 22.7 2.89e-1130.0449  0.0534  
accommodates0.0374  0.00118 31.8 9.41e-2170.0351  0.0397  
host_is_superhostTRUE0.0354  0.00358 9.885.54e-23 0.0284  0.0424  
model4 %>% glance()
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.4150.4150.1681.26e+030138.4e+03-1.68e+04-1.67e+046512304023054
# Test VIF
car::vif(model4)
##                          GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.371707  4        1.040298
## number_of_reviews    1.122623  1        1.059539
## review_scores_rating 1.028858  1        1.014326
## room_type            1.277660  3        1.041684
## bathrooms            1.332709  1        1.154430
## bedrooms             2.863006  1        1.692042
## accommodates         2.930492  1        1.711868
## host_is_superhost    1.122152  1        1.059317
# Check residuals
autoplot(model4)

Interpretation of superhost predictor:

Since the p-value for the estimate of the superhost coefficient is significant (with a p-value of 0.36 and a t-statistic of 0.9), it seems that superhosts do command a price premium from their guests in Copenhagen. Compared to non-superhost, being a superhost is associated with 0.04 increase in log10(price_4_nights).

Since the p-value for the estimate of the superhost coefficient is significant (with a p-value of 0.36 and a t-statistic of 0.9), it seems that superhosts do command a price premium from their guests in Copenhagen. Compared to non-superhost, being a superhost is associated with 0.03 increase in log10(price_4_nights).

# Create model4b by adding host_is_superhost to model4 and create an interaction variable 
model4b<-lm(price_4_nights_log ~
            prop_type_simplified +
             number_of_reviews +
             review_scores_rating +
             room_type + 
             bathrooms +
             bedrooms +
             accommodates +
             host_is_superhost +
             host_is_superhost*review_scores_rating,
             data = listings_log
            )
model4b %>% tidy(conf.int=TRUE)
termestimatestd.errorstatisticp.valueconf.lowconf.high
(Intercept)3.11    0.0165  188   0        3.07    3.14    
prop_type_simplifiedCondominium0.00979 0.00452 2.170.0301   0.0009450.0186  
prop_type_simplifiedHouse-0.0477  0.00592 -8.068.12e-16 -0.0593  -0.0361  
prop_type_simplifiedOther0.0891  0.00708 12.6 3.59e-36 0.0752  0.103   
prop_type_simplifiedTownhouse0.0015  0.00833 0.180.857    -0.0148  0.0178  
number_of_reviews0.0004894.02e-0512.2 5.72e-34 0.00041 0.000568
review_scores_rating0.00141 0.0001668.453.12e-17 0.00108 0.00173 
room_typeHotel room0.228   0.0301  7.554.39e-14 0.169   0.287   
room_typePrivate room-0.183   0.00338 -54.1 0        -0.19    -0.176   
room_typeShared room-0.261   0.0216  -12.1 1.78e-33 -0.304   -0.219   
bathrooms0.058   0.00469 12.4 5.16e-35 0.0488  0.0672  
bedrooms0.0491  0.00216 22.7 2.95e-1130.0449  0.0534  
accommodates0.0374  0.00118 31.8 5.54e-2170.0351  0.0397  
host_is_superhostTRUE-0.242   0.122   -1.990.0471   -0.481   -0.00306 
review_scores_rating:host_is_superhostTRUE0.00285 0.00125 2.280.0228   0.0003960.0053  
model4b %>% glance()
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.4150.4150.1681.17e+030148.41e+03-1.68e+04-1.67e+046512303923054
# Test VIF
car::vif(model4b)
##                                               GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified                      1.371877  4        1.040314
## number_of_reviews                         1.126391  1        1.061316
## review_scores_rating                      1.046224  1        1.022851
## room_type                                 1.278764  3        1.041834
## bathrooms                                 1.333108  1        1.154603
## bedrooms                                  2.863011  1        1.692043
## accommodates                              2.930623  1        1.711906
## host_is_superhost                      1301.822960  1       36.080784
## review_scores_rating:host_is_superhost 1301.627193  1       36.078071
# Check residuals
autoplot(model4b)

Interaction variable host_is_superhost*review_scores_rating is not significant, therefore we do not include it.

Model 4: 41%

Model 5

Most owners advertise the exact location of their listing (is_location_exact == TRUE), while a non-trivial proportion don’t. After controlling for other variables, is a listing’s exact location a significant predictor of price_4_nights?

# Create model5 by adding is_location_exact == TRUE to model4
model5<-lm(price_4_nights_log ~
             prop_type_simplified +
             number_of_reviews +
             review_scores_rating +
             room_type +
             bathrooms +
             bedrooms +
             accommodates +
             host_is_superhost +
             is_location_exact,
             data = listings_log
           )
model5 %>% tidy(conf.int=TRUE)
termestimatestd.errorstatisticp.valueconf.lowconf.high
(Intercept)3.11    0.0165  188    0        3.07    3.14    
prop_type_simplifiedCondominium0.00975 0.00452 2.16 0.0308   0.0009030.0186  
prop_type_simplifiedHouse-0.0478  0.00592 -8.07 7.27e-16 -0.0594  -0.0362  
prop_type_simplifiedOther0.0884  0.00709 12.5  1.36e-35 0.0745  0.102   
prop_type_simplifiedTownhouse0.00103 0.00833 0.1230.902    -0.0153  0.0174  
number_of_reviews0.0004854.01e-0512.1  1.71e-33 0.0004060.000563
review_scores_rating0.00146 0.0001658.85 9.22e-19 0.00114 0.00178 
room_typeHotel room0.227   0.0301  7.52 5.6e-14  0.168   0.286   
room_typePrivate room-0.183   0.00338 -54.1  0        -0.19    -0.176   
room_typeShared room-0.261   0.0216  -12.1  2.35e-33 -0.303   -0.218   
bathrooms0.0582  0.00469 12.4  2.72e-35 0.049   0.0674  
bedrooms0.0491  0.00216 22.7  2.79e-1130.0449  0.0534  
accommodates0.0374  0.00118 31.8  3.56e-2170.0351  0.0397  
host_is_superhostTRUE0.0356  0.00358 9.95 2.93e-23 0.0286  0.0426  
is_location_exactTRUE-0.00547 0.00277 -1.97 0.0484   -0.0109  -3.73e-05
model5 %>% glance()
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.4150.4150.1681.17e+030148.41e+03-1.68e+04-1.67e+046512303923054
# Test VIF
car::vif(model5)
##                          GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.375094  4        1.040619
## number_of_reviews    1.122768  1        1.059607
## review_scores_rating 1.029247  1        1.014518
## room_type            1.278044  3        1.041736
## bathrooms            1.332750  1        1.154448
## bedrooms             2.863006  1        1.692042
## accommodates         2.931407  1        1.712135
## host_is_superhost    1.123547  1        1.059975
## is_location_exact    1.005398  1        1.002695
# Check residuals
autoplot(model5)

A listing’s exact location variable has a p value of 0.34, and therefore is not a significant predictor of price_4_nights. Thus, we will not include it in the following models.

Model 4: 41%

Model 6

In order to reduce the number of neighbourhoods for our analysis we cluster the different neighbourhoods into groups, based on our experience, talking to locals, and research.

The city of Copenhagen has 10 official administrative districts. Additionally, there is Frederiksberg, which is an independent municipality and, thus, separate from the Copenhagen Municipality, however, it is still part of Copenhagen city.

Indre By is the city center of Copenhagen and will remain its own cluster “Center”.

All the residential neighbourhoods surrounding the city center are commonly grouped together, and since they all end in -bro, often referred to as -bro districts (Brokvaterer in Danish). Frederiksberg is often also included in the -bro districts. Hence, we will cluster these 4 Brokvaterer districts into one group of Brokvarterer

Additionally, the two Amager districts (Vest and Ost) are grouped together.

In the West, Valby, Vanlose and Brønshøj-Husum are summarized.

Lastly, Bispebjerg, often referred to as Nordvest, which is more of a residential neighbourhood, is its own cluster.

# Create variable neighbourhood_simplified with 5 categories 
listings_log <- listings_log %>%
  mutate(neighbourhood_simplified = case_when(
    neighbourhood_cleansed %in% c("Indre By") ~ "Center",
    neighbourhood_cleansed %in% c("Frederiksberg","Nrrebro", "sterbro","Vesterbro-Kongens Enghave") ~ "Brokvarterer", 
    neighbourhood_cleansed %in% c("Amager st","Amager Vest") ~ "Amager",
    neighbourhood_cleansed %in% c("Brnshj-Husum","Valby", "Vanlose") ~ "West",
    TRUE ~ "Nordvest"
  ))

# Create model6 by adding neighbourhood_simplified to model3
model6<-lm(price_4_nights_log ~
            prop_type_simplified +
             number_of_reviews +
             review_scores_rating +
             room_type +
             bathrooms +
             bedrooms +
             accommodates +
             neighbourhood_simplified +
             host_is_superhost+
             NULL,
             data = listings_log
           )
model6 %>% tidy(conf.int=TRUE)
termestimatestd.errorstatisticp.valueconf.lowconf.high
(Intercept)3.1     0.0154  202   0        3.07    3.13    
prop_type_simplifiedCondominium0.0143  0.00418 3.420.000622 0.00611 0.0225  
prop_type_simplifiedHouse0.0189  0.0057  3.320.000915 0.00773 0.0301  
prop_type_simplifiedOther0.091   0.00657 13.9 1.83e-43 0.0781  0.104   
prop_type_simplifiedTownhouse0.0141  0.00775 1.820.0689   -0.00109 0.0293  
number_of_reviews0.0002313.73e-056.2 5.92e-10 0.0001580.000304
review_scores_rating0.00132 0.0001538.617.6e-18  0.00102 0.00162 
room_typeHotel room0.139   0.0279  4.976.9e-07  0.0839  0.193   
room_typePrivate room-0.175   0.00313 -55.9 0        -0.181   -0.169   
room_typeShared room-0.246   0.02    -12.3 1.46e-34 -0.285   -0.207   
bathrooms0.0541  0.00434 12.5 1.51e-35 0.0456  0.0626  
bedrooms0.0466  0.002   23.3 1.26e-1180.0427  0.0505  
accommodates0.0364  0.00109 33.4 1.2e-238 0.0343  0.0385  
neighbourhood_simplifiedBrokvarterer0.0274  0.00295 9.291.63e-20 0.0216  0.0331  
neighbourhood_simplifiedCenter0.161   0.00383 42.1 0        0.154   0.169   
neighbourhood_simplifiedNordvest-0.0911  0.00451 -20.2 3.76e-90 -0.0999  -0.0823  
neighbourhood_simplifiedWest-0.0836  0.005   -16.7 1.71e-62 -0.0934  -0.0738  
host_is_superhostTRUE0.0311  0.00331 9.4 5.85e-21 0.0246  0.0376  
model6 %>% glance()
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.50.4990.1551.35e+030171.02e+04-2.04e+04-2.02e+045572303623054
# Test VIF
car::vif(model6)
##                              GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified     1.501568  4        1.052127
## number_of_reviews        1.136203  1        1.065928
## review_scores_rating     1.030892  1        1.015329
## room_type                1.286594  3        1.042894
## bathrooms                1.333482  1        1.154765
## bedrooms                 2.867499  1        1.693369
## accommodates             2.937274  1        1.713848
## neighbourhood_simplified 1.151978  4        1.017842
## host_is_superhost        1.123124  1        1.059775
# Check residuals
autoplot(model6)

One can see that the neighbourhood of an apartment is an predictor of the price for 4 nights in Copenhagen.

Interpretation of neighbourhood predictor:

We can see that all the neighbourhood variables are significant predictors of the price for 4 nights in Copenhagen. Taking Amager as the base, Brokvarterer is asscociated with 0.03 more in log10(price_4_nights); Center is asscociated with 0.16 more in log10(price_4_nights); Nordvest is asscociated with 0.09 drop in log10(price_4_nights); West is asscociated with 0.09 drop in log10(price_4_nights).

Model 6: 50%

Model 7

What is the effect of cancellation_policy on price_4_nights, after we control for other variables?

# Create model7 by adding cancellation_policy to model6
model7<-lm(price_4_nights_log ~
            prop_type_simplified +
             number_of_reviews +
             review_scores_rating +
             room_type +
             bathrooms +
             bedrooms +
             accommodates +
             neighbourhood_simplified +
             cancellation_policy+
             host_is_superhost+NULL,
             data = listings_log
           )
model7 %>% tidy(conf.int=TRUE)
termestimatestd.errorstatisticp.valueconf.lowconf.high
(Intercept)3.09   0.0153  202   0        3.06    3.12    
prop_type_simplifiedCondominium0.0124 0.00415 2.990.00284  0.00425 0.0205  
prop_type_simplifiedHouse0.0189 0.00566 3.350.000822 0.00784 0.03    
prop_type_simplifiedOther0.0734 0.00712 10.3 6.64e-25 0.0595  0.0874  
prop_type_simplifiedTownhouse0.0125 0.0077  1.630.104    -0.00257 0.0276  
number_of_reviews0.000173.73e-054.574.99e-06 9.73e-050.000244
review_scores_rating0.001310.0001528.617.72e-18 0.00101 0.00161 
room_typeHotel room0.175  0.0279  6.293.18e-10 0.121   0.23    
room_typePrivate room-0.17   0.00312 -54.7 0        -0.177   -0.164   
room_typeShared room-0.244  0.0199  -12.3 1.19e-34 -0.283   -0.206   
bathrooms0.0532 0.00431 12.3 7.16e-35 0.0447  0.0616  
bedrooms0.0455 0.00199 22.9 1.16e-1140.0416  0.0494  
accommodates0.0359 0.00108 33.2 1.05e-2350.0338  0.038   
neighbourhood_simplifiedBrokvarterer0.0266 0.00292 9.119.1e-20  0.0209  0.0324  
neighbourhood_simplifiedCenter0.158  0.00381 41.5 0        0.151   0.166   
neighbourhood_simplifiedNordvest-0.0898 0.00447 -20.1 5.59e-89 -0.0986  -0.0811  
neighbourhood_simplifiedWest-0.0805 0.00496 -16.2 6.35e-59 -0.0903  -0.0708  
cancellation_policymoderate0.0269 0.00241 11.2 7.93e-29 0.0222  0.0316  
cancellation_policystrict_14_with_grace_period0.0457 0.00262 17.4 9.54e-68 0.0405  0.0508  
cancellation_policysuper_strict_600.127  0.0173  7.361.88e-13 0.0935  0.161   
host_is_superhostTRUE0.0294 0.00329 8.944.11e-19 0.023   0.0359  
model7 %>% glance()
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.5070.5070.1541.19e+030201.04e+04-2.07e+04-2.05e+045482303323054
# Test VIF
car::vif(model7)
##                              GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified     1.793107  4        1.075724
## number_of_reviews        1.153751  1        1.074128
## review_scores_rating     1.037877  1        1.018762
## room_type                1.312570  3        1.046374
## bathrooms                1.335585  1        1.155675
## bedrooms                 2.871572  1        1.694571
## accommodates             2.943557  1        1.715680
## neighbourhood_simplified 1.161450  4        1.018885
## cancellation_policy      1.268876  3        1.040487
## host_is_superhost        1.124756  1        1.060545
# Check residuals
autoplot(model7)

The cancellation policy is a significant predictor on price, with a flexible policy commanding the lowest price and with a strict policy commanding the highest price. This is surprising since usually hosts should be rewarded for the risk that they are taking with a flexible cancellation policy. On the other hand, usually expensive and nice places have a stricter cancellation policy since the cost of having the place empty for a night is relatively higher. Thus, in the end the cancellation policy might rather be a consequence of the price and not a predictor of it.

Model 7: 51%

PREDICTION

Selecting the best model

After creating all these models, we now find to find our best model, our final model for predictions.

Comparison of models
(1)(2)(3)(4)(5)(6)(7)
(Intercept)3.244 3.356 3.083 3.102 3.106 3.102 3.086 
(0.020)(0.018)(0.016)(0.016)(0.016)(0.015)(0.015)
prop_type_simplifiedCondominium0.033 0.017 0.011 0.010 0.010 0.014 0.012 
(0.006)(0.005)(0.005)(0.005)(0.005)(0.004)(0.004)
prop_type_simplifiedHouse0.119 0.126 -0.048 -0.048 -0.048 0.019 0.019 
(0.007)(0.006)(0.006)(0.006)(0.006)(0.006)(0.006)
prop_type_simplifiedOther0.178 0.168 0.090 0.089 0.088 0.091 0.073 
(0.009)(0.008)(0.007)(0.007)(0.007)(0.007)(0.007)
prop_type_simplifiedTownhouse0.177 0.166 0.002 0.001 0.001 0.014 0.013 
(0.010)(0.009)(0.008)(0.008)(0.008)(0.008)(0.008)
number_of_reviews0.000 0.001 0.001 0.000 0.000 0.000 0.000 
(0.000)(0.000)(0.000)(0.000)(0.000)(0.000)(0.000)
review_scores_rating0.002 0.002 0.002 0.001 0.001 0.001 0.001 
(0.000)(0.000)(0.000)(0.000)(0.000)(0.000)(0.000)
room_typeHotel room     0.112 0.223 0.226 0.227 0.139 0.175 
     (0.034)(0.030)(0.030)(0.030)(0.028)(0.028)
room_typePrivate room     -0.271 -0.181 -0.183 -0.183 -0.175 -0.170 
     (0.004)(0.003)(0.003)(0.003)(0.003)(0.003)
room_typeShared room     -0.362 -0.262 -0.261 -0.261 -0.246 -0.244 
     (0.025)(0.022)(0.022)(0.022)(0.020)(0.020)
bathrooms          0.058 0.058 0.058 0.054 0.053 
          (0.005)(0.005)(0.005)(0.004)(0.004)
bedrooms          0.049 0.049 0.049 0.047 0.045 
          (0.002)(0.002)(0.002)(0.002)(0.002)
accommodates          0.038 0.037 0.037 0.036 0.036 
          (0.001)(0.001)(0.001)(0.001)(0.001)
host_is_superhostTRUE               0.035 0.036 0.031 0.029 
               (0.004)(0.004)(0.003)(0.003)
is_location_exactTRUE                    -0.005           
                    (0.003)          
neighbourhood_simplifiedBrokvarterer                         0.027 0.027 
                         (0.003)(0.003)
neighbourhood_simplifiedCenter                         0.161 0.158 
                         (0.004)(0.004)
neighbourhood_simplifiedNordvest                         -0.091 -0.090 
                         (0.005)(0.004)
neighbourhood_simplifiedWest                         -0.084 -0.081 
                         (0.005)(0.005)
cancellation_policymoderate                              0.027 
                              (0.002)
cancellation_policystrict_14_with_grace_period                              0.046 
                              (0.003)
cancellation_policysuper_strict_60                              0.127 
                              (0.017)
#observations23076     23076     23060     23054     23054     23054     23054     
R squared0.046 0.243 0.413 0.415 0.415 0.500 0.507 
Adj. R Squared0.046 0.243 0.412 0.415 0.415 0.499 0.507 
Residual SE0.215 0.191 0.168 0.168 0.168 0.155 0.154 

Model 7 is the best model with the highest R2 so this will be the final model.

Final model

Let’s print the final model (no. 7).

final_model <- lm(price_4_nights_log ~
                  prop_type_simplified +
                  number_of_reviews +
                  review_scores_rating +
                  room_type +
                  bathrooms +
                  bedrooms +
                  accommodates +
                  neighbourhood_simplified +
                  cancellation_policy
                  +host_is_superhost,
                  data = listings_log
                  )

mosaic::msummary(final_model)
##                                                  Estimate Std. Error t value
## (Intercept)                                     3.086e+00  1.530e-02 201.705
## prop_type_simplifiedCondominium                 1.239e-02  4.149e-03   2.985
## prop_type_simplifiedHouse                       1.893e-02  5.658e-03   3.346
## prop_type_simplifiedOther                       7.342e-02  7.116e-03  10.318
## prop_type_simplifiedTownhouse                   1.251e-02  7.695e-03   1.626
## number_of_reviews                               1.705e-04  3.733e-05   4.566
## review_scores_rating                            1.310e-03  1.522e-04   8.611
## room_typeHotel room                             1.755e-01  2.789e-02   6.293
## room_typePrivate room                          -1.705e-01  3.118e-03 -54.671
## room_typeShared room                           -2.445e-01  1.988e-02 -12.298
## bathrooms                                       5.315e-02  4.308e-03  12.339
## bedrooms                                        4.547e-02  1.987e-03  22.888
## accommodates                                    3.592e-02  1.083e-03  33.169
## neighbourhood_simplifiedBrokvarterer            2.663e-02  2.924e-03   9.107
## neighbourhood_simplifiedCenter                  1.581e-01  3.813e-03  41.464
## neighbourhood_simplifiedNordvest               -8.984e-02  4.472e-03 -20.087
## neighbourhood_simplifiedWest                   -8.053e-02  4.961e-03 -16.232
## cancellation_policymoderate                     2.692e-02  2.413e-03  11.156
## cancellation_policystrict_14_with_grace_period  4.567e-02  2.617e-03  17.449
## cancellation_policysuper_strict_60              1.274e-01  1.730e-02   7.361
## host_is_superhostTRUE                           2.941e-02  3.289e-03   8.942
##                                                Pr(>|t|)    
## (Intercept)                                     < 2e-16 ***
## prop_type_simplifiedCondominium                0.002838 ** 
## prop_type_simplifiedHouse                      0.000822 ***
## prop_type_simplifiedOther                       < 2e-16 ***
## prop_type_simplifiedTownhouse                  0.103997    
## number_of_reviews                              4.99e-06 ***
## review_scores_rating                            < 2e-16 ***
## room_typeHotel room                            3.18e-10 ***
## room_typePrivate room                           < 2e-16 ***
## room_typeShared room                            < 2e-16 ***
## bathrooms                                       < 2e-16 ***
## bedrooms                                        < 2e-16 ***
## accommodates                                    < 2e-16 ***
## neighbourhood_simplifiedBrokvarterer            < 2e-16 ***
## neighbourhood_simplifiedCenter                  < 2e-16 ***
## neighbourhood_simplifiedNordvest                < 2e-16 ***
## neighbourhood_simplifiedWest                    < 2e-16 ***
## cancellation_policymoderate                     < 2e-16 ***
## cancellation_policystrict_14_with_grace_period  < 2e-16 ***
## cancellation_policysuper_strict_60             1.88e-13 ***
## host_is_superhostTRUE                           < 2e-16 ***
## 
## Residual standard error: 0.1543 on 23033 degrees of freedom
##   (5469 observations deleted due to missingness)
## Multiple R-squared:  0.5072, Adjusted R-squared:  0.5068 
## F-statistic:  1185 on 20 and 23033 DF,  p-value: < 2.2e-16
get_regression_table(final_model)
termestimatestd_errorstatisticp_valuelower_ciupper_ci
intercept3.09 0.015202   0    3.06 3.12 
prop_type_simplifiedCondominium0.0120.0042.980.0030.0040.021
prop_type_simplifiedHouse0.0190.0063.350.0010.0080.03 
prop_type_simplifiedOther0.0730.00710.3 0    0.0590.087
prop_type_simplifiedTownhouse0.0130.0081.630.104-0.0030.028
number_of_reviews0    0    4.570    0    0    
review_scores_rating0.0010    8.610    0.0010.002
room_typeHotel room0.1750.0286.290    0.1210.23 
room_typePrivate room-0.17 0.003-54.7 0    -0.177-0.164
room_typeShared room-0.2440.02 -12.3 0    -0.283-0.206
bathrooms0.0530.00412.3 0    0.0450.062
bedrooms0.0450.00222.9 0    0.0420.049
accommodates0.0360.00133.2 0    0.0340.038
neighbourhood_simplifiedBrokvarterer0.0270.0039.110    0.0210.032
neighbourhood_simplifiedCenter0.1580.00441.5 0    0.1510.166
neighbourhood_simplifiedNordvest-0.09 0.004-20.1 0    -0.099-0.081
neighbourhood_simplifiedWest-0.0810.005-16.2 0    -0.09 -0.071
cancellation_policymoderate0.0270.00211.2 0    0.0220.032
cancellation_policystrict_14_with_grace_period0.0460.00317.4 0    0.0410.051
cancellation_policysuper_strict_600.1270.0177.360    0.0930.161
host_is_superhostTRUE0.0290.0038.940    0.0230.036
get_regression_summaries(final_model)
r_squaredadj_r_squaredmsermsesigmastatisticp_valuedfnobs
0.5070.5070.02380.1540.1541.19e+030202.31e+04
vif(final_model)
##                              GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified     1.793107  4        1.075724
## number_of_reviews        1.153751  1        1.074128
## review_scores_rating     1.037877  1        1.018762
## room_type                1.312570  3        1.046374
## bathrooms                1.335585  1        1.155675
## bedrooms                 2.871572  1        1.694571
## accommodates             2.943557  1        1.715680
## neighbourhood_simplified 1.161450  4        1.018885
## cancellation_policy      1.268876  3        1.040487
## host_is_superhost        1.124756  1        1.060545

Analysis of Coefficients in Final Model

  • Property Type Apartments have the lowest price. Compared to an apartment, a town house has a 1.2% higher price, a house a 1.9% higher price and a condominium a 1.2% higher price.“Other” property types have a 7.6% higher price than apartments, which makes sense since it includes special properties such as house boats etc.

  • Number of Reviews Already one additional review on AirBnB increases the price of the place 0.02%

  • Review_scores_rating For every increase in the review score rating, the price rises by 0.13%

  • Room Type A shared room and private room will have the lowest prices, with 21.69% and 15.6% respectively less than an entire home. This is intuitive since you have to share rooms and/or share facilities, such as bathroom and kitchen with other people. Hotel rooms have the highest price with around 19% more than an entire home.

  • Bathrooms & Bedrooms The number of bathrooms and bedrooms also have an influence on the price. An additional bathroom will increase the price by 5.4% and an additional bedroom by 4.6%. (both by roughly 5%).

  • Accommodates Offering an additional accommodate rises the price by 3.6%

  • Superhost Being a super host can increase your price by 2.7%, so it might be worth it to make to superhost!

  • Neighbourhoods Looking at the neighbourhoods, we can also see clear trends there. Highest prices are found in the center, which is the case in most cities. There are 17.2% higher than in Amager. Prices in Brokvarterer are 2.7% higher than in Amager. The lowest prices are in Nordvest and the West (8.6% and 7.7% lower than in Amager), which are more residential areas.

  • Cancellation Policy Places with flexible cancellation policies have the lowest prices, those with moderate ones are 2.7% higher and those with strict ones 4.9%. The super strict ones have the highest prices with 13.5%

  • In conclusion, this model can help AirBnB hosts understand what drives AirBnB prices, e.g. which in neighborhoods one can demand higher prices, and learn which features customer accept to pay more for .

Prediction

We are planning to visit Copenhagen and want to stay in an AirBnB. We would like to look at AirBnB’s that are apartments with a private room, have at least 10 reviews, and an average rating of at least 90.

We will predict the total cost to stay at this AirBnB for 4 nights, including the appropriate 95% interval with the prediction.

# Filter our the data by room_type, nr of reviews and rating
listings_predict_log <- listings_log %>% 
  filter(room_type == "Private room",
         number_of_reviews >= 10,
         review_scores_rating >= 90,
         accommodates >= 2
         ) 

# Generate a prediction for each row
model_predictions_log_CI <- predict(final_model, newdata = listings_predict_log, interval = "confidence")

# Generate a prediction for each row with broom
model_predictions_log <- broom::augment(final_model,
                             newdata = listings_predict_log, se_fit = TRUE)


# Convert prices back from log
model_predictions <- model_predictions_log  %>%
  mutate(lower = .fitted - 1.96 * .se.fit, 
         upper = .fitted + 1.96 * .se.fit) %>%
  mutate(fitted_price_final = 10^(.fitted),
         upper_final = 10^(upper),
         lower_final = 10^(lower))

Next we visualise the data to have a better understanding of the structure.

# Change order of property types to have other in the end
model_predictions$prop_type_simplified <- factor(model_predictions$prop_type_simplified, c("Apartment", "Condominium", "House", "Townhouse", "Other"))

# Graph distribution of fitted values  
ggplot(model_predictions, aes(x = fitted_price_final)) +
  geom_density() +
  labs(y = "",
       x = "Price of an AirBnB in Copenhagen") +
  theme_classic() +
  NULL

# Facet by neighborhood
ggplot(model_predictions, aes(x = fitted_price_final)) +
  geom_density() +
  labs(y = "",
       x = "Price of an AirBnB in Copenhagen") +
  facet_wrap(~neighbourhood_simplified) +
  theme_classic() +
  NULL

# Facet by prop_type_simplified
ggplot(model_predictions, aes(x = fitted_price_final)) +
  geom_density() +
  labs(y = "",
       x = "Price of an AirBnB in Copenhagen") +
  facet_wrap(~prop_type_simplified) +
  theme_classic() +
  NULL

# Facet by neighborhood & prop_type_simplified
ggplot(model_predictions, aes(x = fitted_price_final)) + 
  geom_density() +
  labs(y = "",
       x = "Price of an AirBnB in Copenhagen") +
  facet_grid(
    neighbourhood_simplified ~ prop_type_simplified,
    scales = "free"
    ) +
  theme_classic() +
  NULL

# Splitting data in training and testing sets
library(rsample)
## Warning: package 'rsample' was built under R version 3.6.2
set.seed(1234)  # for reproducibility, and to always get the same split, set the random seed first

train_test_split <- initial_split(listings_log, prop = 0.80)
listings_train <- training(train_test_split)
listings_test <- testing(train_test_split)
 
rmse_train <- listings_train %>%
  mutate(predictions = predict(final_model, .)) %>%
  summarise(sqrt(sum(predictions - price_4_nights_log,na.rm = TRUE)**2/n())) %>% 
   pull()

rmse_train
## [1] 0.01582266
rmse_test <- listings_test %>%
  mutate(predictions = predict(final_model, .)) %>%
  summarise(sqrt(sum(predictions - price_4_nights_log,na.rm = TRUE)**2/n())) %>% 
  pull()

rmse_test
## [1] 0.0316474
# Find mean and median of the fitted prices
model_predictions %>%
  summarize(mean_price = mean(fitted_price_final, na.rm = TRUE), 
            median_price = median(fitted_price_final, na.rm = TRUE)
            )
mean_pricemedian_price
1.98e+031.9e+03

As the price distribution is right skewed, choosing mean as measure to predict the final price wouldn’t be economical so we’ve decided to use the median because it fits our budget.

# locate the median value property
predicted_value <- model_predictions %>%
  filter(fitted_price_final == median(fitted_price_final, na.rm = TRUE))

predicted_value %>% 
  select(listing_url, accommodates, price, cleaning_fee, extra_people, prop_type_simplified, room_type, review_scores_rating, bathrooms, bedrooms, beds, neighbourhood_cleansed)
listing_urlaccommodatespricecleaning_feeextra_peopleprop_type_simplifiedroom_typereview_scores_ratingbathroomsbedroomsbedsneighbourhood_cleansed
https://www.airbnb.com/rooms/113391172452350200ApartmentPrivate room98111Amager Vest
# print the point estimates and their confidence intervals of these median properties 
predicted_value[, 29:31] 
fitted_price_finalupper_finallower_final
1.9e+031.94e+031.86e+03

Conclusion

To conclude, we can predict, with 95% confidence, that a median listing on AirBnB in Copenhagen, for 2 people and 4 nights, will cost between 1,860.18 DKK and 1,938.58 DKK.